-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add retry limiter to backoff function #1478
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@Tema It's ok to introduce a rate-limiting mechanism for the kv-client. Finding the optimal balance between error handling, retry success, and avoiding overloading PD could be challenging, or selecting a suitable default value that works for most scenarios. Another approach is similar to TiKV health control feedback (as discussed in tikv/tikv#16297), where some processing capacity information is carried in PD responses and fed back to the KV client. Based on this feedback, the kv-client can then decide its concurrency control and rate-limiting strategy accordingly. |
Thanks cfzjywxk for the comment. I think it is not always possible for PD to reply to provide this information to TiDB in case it is completely overloaded. tikv/pd#8678 proposes a more sophisticated solution to cover that case as well. Maybe it is worth to see if it could be incorporated into tikv/tikv#16297 which you mentioned. |
@@ -96,6 +98,50 @@ func NewConfig(name string, metric *prometheus.Observer, backoffFnCfg *BackoffFn | |||
} | |||
} | |||
|
|||
type RetryRateLimiter struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need comments like for exported functions and type definitions,
// RetryRateLimiter is used to limit retry times for PD requests.
or the lint check would fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somehow golangci-lint run
didn't complain. Fixed each comment
config/retry/config.go
Outdated
cap int32 | ||
} | ||
|
||
func NewRetryRateLimiter(cap int32, ratio float32) *RetryRateLimiter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto for the comments, and better to explain the meaning of the input parameters.
Besides, would it be less expensive using int
or uint
type for ratio
instead of float values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason it is float that the ration usually less than 1. E.g. 1 retry for each 10 success meant 0.1. In the last commit I've inverted it to be successPerRetryCount, so now can use int. I chose int
over uint
as there is no rand.uint
version.
config/retry/config.go
Outdated
} | ||
} | ||
|
||
// add a token to the rate limiter bucket according to configured retry to success ratio and cap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment format is like
// addRetryToken is a ...
, needs to start with the function name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
config/retry/config.go
Outdated
|
||
// add a token to the rate limiter bucket according to configured retry to success ratio and cap | ||
func (r *RetryRateLimiter) addRetryToken() { | ||
if rand.Float32() < r.allowedRetryToSuccessRatio { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As metioned above, would it be less expensive to use integer random values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
return false | ||
} | ||
|
||
func NewConfigWithRetryLimit(name string, metric *prometheus.Observer, backoffFnCfg *BackoffFnCfg, retryRateLimiter *RetryRateLimiter, err error) *Config { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto for the comments of exported function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
config/retry/config.go
Outdated
BoTiFlashRPC = NewConfig("tiflashRPC", &metrics.BackoffHistogramRPC, NewBackoffFnCfg(100, 2000, EqualJitter), tikverr.ErrTiFlashServerTimeout) | ||
BoTxnLock = NewConfig("txnLock", &metrics.BackoffHistogramLock, NewBackoffFnCfg(100, 3000, EqualJitter), tikverr.ErrResolveLockTimeout) | ||
BoPDRPC = NewConfig("pdRPC", &metrics.BackoffHistogramPD, NewBackoffFnCfg(500, 3000, EqualJitter), tikverr.NewErrPDServerTimeout("")) | ||
BoPDRegionMetadata = NewConfigWithRetryLimit("pdRegionMetadata", &metrics.BackoffHistogramPD, NewBackoffFnCfg(500, 3000, EqualJitter), NewRetryRateLimiter(10, 0.1), tikverr.NewErrPDServerTimeout("")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the previous comments, it would be challendge to choose a default value for all kinds of scenarios? Do we have some tests for the choice of 10, 0.1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, choosing the config value is tricky. 10% retry to success seems a reasonable limit. 10 as a retry cap does not matter much as in steady state it will still be proportional to success rate and just limit the bursts.
Maybe we need to look at a way to allow to configure it per type at tidb-server, then we can disable by default and give a way to enable it through configuration with specific values. But I don't know what could be a good config for that. Sysvars and toml would be messy for that. Maybe some system table could be a good interface for that, but I haven't seen such approach at TiDB yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cfzjywxk @niubell how about we add very conservative config, say (cap: 1000, success/retry: 1). Basically for each success we allow one more retry on average with very relaxed bursts of 1000 retries and use it just for loadRegion(s). This way we unlikely affect any steady state, but still limit QPS spike due to retries at most to 2x of steady state.
Signed-off-by: artem_danilov <artem_danilov@airbnb.com>
Signed-off-by: artem_danilov <artem_danilov@airbnb.com>
Signed-off-by: artem_danilov <artem_danilov@airbnb.com>
Current backoff policy does not help much to limit TiDB reties to retrieve Region from PD when there are issues with Region Metadata in PD:
This PR adds an ability to configure global retry limiter to Backoff function per each config. It also creates a new Backoff config dedicated to PD Region Metadata calls which will be used in TiDB in separate PR:
The config above allows a single retry per each 10 previous successful call (0.1), but limit overall retry budget to 10. It always start with full budget of retries.